I am a Data Analyst at a Supermarket and have been given a dataset. My task is to give business Insights on Customer behaviour and also build a model that properly divides the customer market segment so that we can make segment specific Advertisements, marketing Strategies, etc.
https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
I had problems with reading the data properly so I had to use the delimiter "\t"
data= pd.read_csv("marketing_campaign.csv", delimiter="\t")
data
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | ... | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Z_CostContact | Z_Revenue | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5524 | 1957 | Graduation | Single | 58138.0 | 0 | 0 | 04-09-2012 | 58 | 635 | ... | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 1 |
| 1 | 2174 | 1954 | Graduation | Single | 46344.0 | 1 | 1 | 08-03-2014 | 38 | 11 | ... | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 2 | 4141 | 1965 | Graduation | Together | 71613.0 | 0 | 0 | 21-08-2013 | 26 | 426 | ... | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 3 | 6182 | 1984 | Graduation | Together | 26646.0 | 1 | 0 | 10-02-2014 | 26 | 11 | ... | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 4 | 5324 | 1981 | PhD | Married | 58293.0 | 1 | 0 | 19-01-2014 | 94 | 173 | ... | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2235 | 10870 | 1967 | Graduation | Married | 61223.0 | 0 | 1 | 13-06-2013 | 46 | 709 | ... | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 2236 | 4001 | 1946 | PhD | Together | 64014.0 | 2 | 1 | 10-06-2014 | 56 | 406 | ... | 7 | 0 | 0 | 0 | 1 | 0 | 0 | 3 | 11 | 0 |
| 2237 | 7270 | 1981 | Graduation | Divorced | 56981.0 | 0 | 0 | 25-01-2014 | 91 | 908 | ... | 6 | 0 | 1 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 2238 | 8235 | 1956 | Master | Together | 69245.0 | 0 | 1 | 24-01-2014 | 8 | 428 | ... | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 2239 | 9405 | 1954 | PhD | Married | 52869.0 | 1 | 1 | 15-10-2012 | 40 | 84 | ... | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 1 |
2240 rows × 29 columns
Removing Unnecessary Columns
drop_cols= ["Response", "Z_Revenue", "Z_CostContact", "Complain"]
for col in drop_cols:
print(data[col].unique())
[1 0] [11] [3] [0 1]
Columns "Z_Revenue" and "Z_CostContact" have only 1 unique Value, therefore can be removed.
Also columns "Complain" and "Response" are not required
data.drop(drop_cols, axis=1, inplace=True)
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2240 entries, 0 to 2239 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 2240 non-null int64 1 Year_Birth 2240 non-null int64 2 Education 2240 non-null object 3 Marital_Status 2240 non-null object 4 Income 2216 non-null float64 5 Kidhome 2240 non-null int64 6 Teenhome 2240 non-null int64 7 Dt_Customer 2240 non-null object 8 Recency 2240 non-null int64 9 MntWines 2240 non-null int64 10 MntFruits 2240 non-null int64 11 MntMeatProducts 2240 non-null int64 12 MntFishProducts 2240 non-null int64 13 MntSweetProducts 2240 non-null int64 14 MntGoldProds 2240 non-null int64 15 NumDealsPurchases 2240 non-null int64 16 NumWebPurchases 2240 non-null int64 17 NumCatalogPurchases 2240 non-null int64 18 NumStorePurchases 2240 non-null int64 19 NumWebVisitsMonth 2240 non-null int64 20 AcceptedCmp3 2240 non-null int64 21 AcceptedCmp4 2240 non-null int64 22 AcceptedCmp5 2240 non-null int64 23 AcceptedCmp1 2240 non-null int64 24 AcceptedCmp2 2240 non-null int64 dtypes: float64(1), int64(21), object(3) memory usage: 437.6+ KB
The column "Income" has some missing values that need to be removed
data.dropna(axis=0, inplace=True)
print("The length of the dataset after removing the missing values is now: "+ str(len(data)))
The length of the dataset after removing the missing values is now: 2216
data["Age"]= 2023- data["Year_Birth"]
The columns "MntWines", "MntFruits"....etc are basically the amount a customer spent on the specific item.
data.columns
Index(['ID', 'Year_Birth', 'Education', 'Marital_Status', 'Income', 'Kidhome',
'Teenhome', 'Dt_Customer', 'Recency', 'MntWines', 'MntFruits',
'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',
'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1',
'AcceptedCmp2', 'Age'],
dtype='object')
data.columns= ['ID', 'Year_Birth', 'Education', 'Marital_Status', 'Income', 'Kids home',
'Teens home', 'Dt_Customer', 'Last Bought', 'Wines', 'Fruits',
'Meat', 'Fish', 'Sweets',
'Gold', 'Deals Buying', 'Online buyings',
'Catalog Buying', 'Store buyings', 'Online Visits monthly',
'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1',
'AcceptedCmp2', 'Age']
data["Children"]=data["Kids home"]+data["Teens home"]
data["Total_Spent"]= data["Wines"]+ data["Fruits"]+data["Meat"]+ data["Fish"] + data["Sweets"]+ data["Gold"]
data.dtypes
ID int64 Year_Birth int64 Education object Marital_Status object Income float64 Kids home int64 Teens home int64 Dt_Customer object Last Bought int64 Wines int64 Fruits int64 Meat int64 Fish int64 Sweets int64 Gold int64 Deals Buying int64 Online buyings int64 Catalog Buying int64 Store buyings int64 Online Visits monthly int64 AcceptedCmp3 int64 AcceptedCmp4 int64 AcceptedCmp5 int64 AcceptedCmp1 int64 AcceptedCmp2 int64 Age int64 Children int64 Total_Spent int64 dtype: object
data["Dt_Customer"]= pd.to_datetime(data["Dt_Customer"])
data.dtypes
ID int64 Year_Birth int64 Education object Marital_Status object Income float64 Kids home int64 Teens home int64 Dt_Customer datetime64[ns] Last Bought int64 Wines int64 Fruits int64 Meat int64 Fish int64 Sweets int64 Gold int64 Deals Buying int64 Online buyings int64 Catalog Buying int64 Store buyings int64 Online Visits monthly int64 AcceptedCmp3 int64 AcceptedCmp4 int64 AcceptedCmp5 int64 AcceptedCmp1 int64 AcceptedCmp2 int64 Age int64 Children int64 Total_Spent int64 dtype: object
data.head()
| ID | Year_Birth | Education | Marital_Status | Income | Kids home | Teens home | Dt_Customer | Last Bought | Wines | ... | Store buyings | Online Visits monthly | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Age | Children | Total_Spent | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5524 | 1957 | Graduation | Single | 58138.0 | 0 | 0 | 2012-04-09 | 58 | 635 | ... | 4 | 7 | 0 | 0 | 0 | 0 | 0 | 66 | 0 | 1617 |
| 1 | 2174 | 1954 | Graduation | Single | 46344.0 | 1 | 1 | 2014-08-03 | 38 | 11 | ... | 2 | 5 | 0 | 0 | 0 | 0 | 0 | 69 | 2 | 27 |
| 2 | 4141 | 1965 | Graduation | Together | 71613.0 | 0 | 0 | 2013-08-21 | 26 | 426 | ... | 10 | 4 | 0 | 0 | 0 | 0 | 0 | 58 | 0 | 776 |
| 3 | 6182 | 1984 | Graduation | Together | 26646.0 | 1 | 0 | 2014-10-02 | 26 | 11 | ... | 4 | 6 | 0 | 0 | 0 | 0 | 0 | 39 | 1 | 53 |
| 4 | 5324 | 1981 | PhD | Married | 58293.0 | 1 | 0 | 2014-01-19 | 94 | 173 | ... | 6 | 5 | 0 | 0 | 0 | 0 | 0 | 42 | 1 | 422 |
5 rows × 28 columns
data["Marital_Status"].unique()
array(['Single', 'Together', 'Married', 'Divorced', 'Widow', 'Alone',
'Absurd', 'YOLO'], dtype=object)
Here, there are values like "Together" that depict couples but are not married and also few other values like "Divorced", "Widow", "Alone"....etc that simply mean Single
I will use mapping for this. Will first make a dictonary with all the corresspondant values and the map it to the actual column.
dict={'Single': "Alone" , 'Together': "Partner" , 'Married': "Partner", 'Divorced': "Alone", 'Widow': "Alone", 'Alone':"Alone",
'Absurd':"Alone", 'YOLO':"Alone"}
data["Companion"]= data["Marital_Status"]
data.drop("Marital_Status", axis=1, inplace= True)
data["Companion"]=data["Companion"].map(dict)
Now, we have a column to define whether or not a customer has a partner or not. We can also make a column that will determine whether a customer has children or not
data["has_kids"]=data["Children"].apply(lambda x:1 if x>0 else 0)
data.head()
| ID | Year_Birth | Education | Income | Kids home | Teens home | Dt_Customer | Last Bought | Wines | Fruits | ... | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Age | Children | Total_Spent | Companion | has_kids | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5524 | 1957 | Graduation | 58138.0 | 0 | 0 | 2012-04-09 | 58 | 635 | 88 | ... | 0 | 0 | 0 | 0 | 0 | 66 | 0 | 1617 | Alone | 0 |
| 1 | 2174 | 1954 | Graduation | 46344.0 | 1 | 1 | 2014-08-03 | 38 | 11 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 69 | 2 | 27 | Alone | 1 |
| 2 | 4141 | 1965 | Graduation | 71613.0 | 0 | 0 | 2013-08-21 | 26 | 426 | 49 | ... | 0 | 0 | 0 | 0 | 0 | 58 | 0 | 776 | Partner | 0 |
| 3 | 6182 | 1984 | Graduation | 26646.0 | 1 | 0 | 2014-10-02 | 26 | 11 | 4 | ... | 0 | 0 | 0 | 0 | 0 | 39 | 1 | 53 | Partner | 1 |
| 4 | 5324 | 1981 | PhD | 58293.0 | 1 | 0 | 2014-01-19 | 94 | 173 | 43 | ... | 0 | 0 | 0 | 0 | 0 | 42 | 1 | 422 | Partner | 1 |
5 rows × 29 columns
data["Companion"].value_counts()
Partner 1430 Alone 786 Name: Companion, dtype: int64
print("Datapoints with value: Alone, is only "
+ str(np.round((len(data[data["Companion"]=="Alone"])/len(data))*100, 2))
+"% of the total datapoints")
Datapoints with value: Alone, is only 35.47% of the total datapoints
data_alone= data[data["Companion"]=="Alone"]
data_partner= data[data["Companion"]=="Partner"]
data_partner.head(4)
| ID | Year_Birth | Education | Income | Kids home | Teens home | Dt_Customer | Last Bought | Wines | Fruits | ... | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Age | Children | Total_Spent | Companion | has_kids | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 4141 | 1965 | Graduation | 71613.0 | 0 | 0 | 2013-08-21 | 26 | 426 | 49 | ... | 0 | 0 | 0 | 0 | 0 | 58 | 0 | 776 | Partner | 0 |
| 3 | 6182 | 1984 | Graduation | 26646.0 | 1 | 0 | 2014-10-02 | 26 | 11 | 4 | ... | 0 | 0 | 0 | 0 | 0 | 39 | 1 | 53 | Partner | 1 |
| 4 | 5324 | 1981 | PhD | 58293.0 | 1 | 0 | 2014-01-19 | 94 | 173 | 43 | ... | 0 | 0 | 0 | 0 | 0 | 42 | 1 | 422 | Partner | 1 |
| 5 | 7446 | 1967 | Master | 62513.0 | 0 | 1 | 2013-09-09 | 16 | 520 | 42 | ... | 0 | 0 | 0 | 0 | 0 | 56 | 1 | 716 | Partner | 1 |
4 rows × 29 columns
data_partner=data_partner.sample(frac=1)
data_partner.head(4)
| ID | Year_Birth | Education | Income | Kids home | Teens home | Dt_Customer | Last Bought | Wines | Fruits | ... | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Age | Children | Total_Spent | Companion | has_kids | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1685 | 7947 | 1969 | Graduation | 42231.0 | 1 | 1 | 2014-03-25 | 99 | 24 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 54 | 2 | 37 | Partner | 1 |
| 1229 | 833 | 1955 | Master | 38452.0 | 1 | 1 | 2014-03-30 | 62 | 56 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 68 | 2 | 72 | Partner | 1 |
| 280 | 4669 | 1981 | Basic | 24480.0 | 1 | 0 | 2013-11-02 | 46 | 4 | 19 | ... | 0 | 0 | 0 | 0 | 0 | 42 | 1 | 102 | Partner | 1 |
| 147 | 3120 | 1981 | Graduation | 38547.0 | 1 | 0 | 2013-08-28 | 49 | 6 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 42 | 1 | 22 | Partner | 1 |
4 rows × 29 columns
data_partner=data_partner[0:786]
Here, we randomly took 786 datapoints with value "Partner" to match the number of "Alone" datapoints
frames=[data_alone, data_partner]
data_EDA=pd.concat(frames)
data_EDA=data_EDA.sample(frac=1)
data_EDA.head(4)
| ID | Year_Birth | Education | Income | Kids home | Teens home | Dt_Customer | Last Bought | Wines | Fruits | ... | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Age | Children | Total_Spent | Companion | has_kids | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 614 | 10299 | 1969 | PhD | 48240.0 | 0 | 0 | 2012-09-11 | 73 | 389 | 91 | ... | 0 | 0 | 0 | 0 | 0 | 54 | 0 | 882 | Alone | 0 |
| 249 | 8932 | 1969 | Master | 65176.0 | 0 | 1 | 2012-10-29 | 57 | 960 | 28 | ... | 0 | 0 | 0 | 0 | 0 | 54 | 1 | 1531 | Partner | 1 |
| 543 | 5547 | 1982 | PhD | 84169.0 | 0 | 0 | 2013-07-08 | 9 | 1478 | 19 | ... | 0 | 1 | 1 | 0 | 0 | 41 | 0 | 1919 | Partner | 0 |
| 531 | 2004 | 1969 | Graduation | 72679.0 | 0 | 1 | 2013-09-18 | 65 | 619 | 54 | ... | 0 | 0 | 0 | 0 | 0 | 54 | 1 | 1168 | Alone | 1 |
4 rows × 29 columns
data_EDA["Companion"].value_counts()
Alone 786 Partner 786 Name: Companion, dtype: int64
data.drop(["ID", "Year_Birth", "Kids home", "Teens home"], axis=1, inplace=True)
.dt helps us extract year, month and day
data["Year"]=data["Dt_Customer"].dt.year
data["Month"]=data["Dt_Customer"].dt.month
data["Day"]=data["Dt_Customer"].dt.day
data.drop("Dt_Customer", axis=1, inplace=True)
group_wine=data_EDA.groupby(by="Companion", as_index=False )["Wines"].sum()
fig = px.bar(group_wine, x="Companion", y="Wines", height=400, width=400, title="Wines Vs Companion")
fig.show()
group_fruits=data_EDA.groupby(by="Companion", as_index=False )["Fruits"].sum()
fig = px.bar(group_fruits, x="Companion", y="Fruits", height=400, width=400, title="Fruits Vs Companion")
fig.show()
group_fish=data_EDA.groupby(by="Companion", as_index=False )["Fish"].sum()
fig = px.bar(group_fish, x="Companion", y="Fish", height=400, width=400, title="Fish Vs Companion")
fig.show()
group_meat=data_EDA.groupby(by="Companion", as_index=False )["Meat"].sum()
fig = px.bar(group_meat, x="Companion", y="Meat", height=400, width=400, title="Meat Vs Companion")
fig.show()
group_gold=data_EDA.groupby(by="Companion", as_index=False )["Gold"].sum()
fig = px.bar(group_gold, x="Companion", y="Gold", height=400, width=400, title="Gold Vs Companion")
fig.show()
data["Education"].unique()
array(['Graduation', 'PhD', 'Master', 'Basic', '2n Cycle'], dtype=object)
mape = {
"Basic": "Undergraduate",
"2n Cycle": "Undergraduate",
"Graduation": "Graduate",
"Master": "Postgraduate",
"PhD": "Postgraduate"
}
data["Education"] = data["Education"].map(mape)
data_EDA["Education"] = data_EDA["Education"].map(mape)
fig = px.violin(data_EDA, x="Education", y="Total_Spent", height=500, width=800, title="Education Vs Total Sales")
fig.show()
fig = px.violin(data_EDA, x="Education", y="Total_Spent", color="Companion",
height=500, width=800, title="Education Vs Total Sales with respect to Companionship")
fig.show()
fig = px.violin(data_EDA, x="Education", y="Total_Spent", color="has_kids",
height=500, width=800, title="Education Vs Total Sales with respect to having or not having kids")
fig.show()
fig = px.scatter(data_EDA, x="Income", y="Total_Spent", height=500, width=800, title="Income Vs Total Spending", trendline="ols")
fig.update_layout(yaxis_range=[-100,3000])
fig.update_layout(xaxis_range=[0,110000])
fig.show()
my_dict = {"log_x": True}
fig = px.scatter(data_EDA, x="Income", y="Total_Spent", height=500, width=800, title="Income Vs Total Spending", trendline="ols",trendline_options=my_dict)
fig.update_layout(yaxis_range=[-100,3000])
fig.update_layout(xaxis_range=[0,110000])
fig.show()
data_EDA["Age"].nunique()
58
data_age=data_EDA.groupby("Age", as_index=False)["Online Visits monthly"].count()
fig = px.bar(data_age, x="Age", y="Online Visits monthly", height=500, width=800, title="Age Vs Number of Web Visits per month")
#fig.update_layout(yaxis_range=[-100,3000])
#fig.update_layout(xaxis_range=[0,110000])
fig.show()
data.corr()
| Income | Last Bought | Wines | Fruits | Meat | Fish | Sweets | Gold | Deals Buying | Online buyings | ... | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Age | Children | Total_Spent | has_kids | Year | Month | Day | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Income | 1.000000 | -0.003970 | 0.578650 | 0.430842 | 0.584633 | 0.438871 | 0.440744 | 0.325916 | -0.083101 | 0.387878 | ... | 0.335943 | 0.276820 | 0.087545 | 0.161791 | -0.293352 | 0.667576 | -0.338153 | 0.022451 | -0.013887 | -0.031473 |
| Last Bought | -0.003970 | 1.000000 | 0.015721 | -0.005844 | 0.022518 | 0.000551 | 0.025110 | 0.017663 | 0.002115 | -0.005641 | ... | -0.000482 | -0.021061 | -0.001400 | 0.016295 | 0.018290 | 0.020066 | 0.002485 | -0.027064 | -0.004930 | 0.018781 |
| Wines | 0.578650 | 0.015721 | 1.000000 | 0.387024 | 0.568860 | 0.397721 | 0.390326 | 0.392731 | 0.008886 | 0.553786 | ... | 0.473550 | 0.351417 | 0.206185 | 0.159451 | -0.353748 | 0.893136 | -0.343094 | -0.154991 | 0.039186 | 0.000058 |
| Fruits | 0.430842 | -0.005844 | 0.387024 | 1.000000 | 0.547822 | 0.593431 | 0.571606 | 0.396487 | -0.134512 | 0.302039 | ... | 0.212871 | 0.191816 | -0.009980 | 0.017747 | -0.395901 | 0.613249 | -0.411963 | -0.054961 | 0.000414 | -0.021932 |
| Meat | 0.584633 | 0.022518 | 0.568860 | 0.547822 | 1.000000 | 0.573574 | 0.535136 | 0.359446 | -0.121308 | 0.307090 | ... | 0.376867 | 0.313076 | 0.043521 | 0.033697 | -0.504545 | 0.845884 | -0.574931 | -0.078562 | 0.030105 | -0.019195 |
| Fish | 0.438871 | 0.000551 | 0.397721 | 0.593431 | 0.573574 | 1.000000 | 0.583867 | 0.427142 | -0.143241 | 0.299688 | ... | 0.196277 | 0.261608 | 0.002345 | 0.040425 | -0.427841 | 0.642371 | -0.450318 | -0.067327 | -0.011281 | -0.015993 |
| Sweets | 0.440744 | 0.025110 | 0.390326 | 0.571606 | 0.535136 | 0.583867 | 1.000000 | 0.357450 | -0.121432 | 0.333937 | ... | 0.259230 | 0.245102 | 0.010188 | 0.020204 | -0.389411 | 0.607062 | -0.402722 | -0.073794 | 0.006082 | 0.001321 |
| Gold | 0.325916 | 0.017663 | 0.392731 | 0.396487 | 0.359446 | 0.427142 | 0.357450 | 1.000000 | 0.051905 | 0.407066 | ... | 0.181397 | 0.170132 | 0.050734 | 0.064208 | -0.268918 | 0.528708 | -0.247433 | -0.143728 | 0.020835 | 0.001498 |
| Deals Buying | -0.083101 | 0.002115 | 0.008886 | -0.134512 | -0.121308 | -0.143241 | -0.121432 | 0.051905 | 1.000000 | 0.241440 | ... | -0.184253 | -0.127374 | -0.037981 | 0.058668 | 0.436076 | -0.065854 | 0.388425 | -0.185314 | -0.002327 | -0.002719 |
| Online buyings | 0.387878 | -0.005641 | 0.553786 | 0.302039 | 0.307090 | 0.299688 | 0.333937 | 0.407066 | 0.241440 | 1.000000 | ... | 0.141189 | 0.159292 | 0.034829 | 0.153051 | -0.148871 | 0.528973 | -0.074008 | -0.169698 | 0.021741 | 0.007970 |
| Catalog Buying | 0.589162 | 0.024081 | 0.634753 | 0.486263 | 0.734127 | 0.532757 | 0.495136 | 0.442428 | -0.012118 | 0.386868 | ... | 0.322471 | 0.309026 | 0.099915 | 0.121764 | -0.443474 | 0.780482 | -0.453470 | -0.086008 | 0.003736 | -0.021842 |
| Store buyings | 0.529362 | -0.000434 | 0.640012 | 0.458491 | 0.486006 | 0.457745 | 0.455225 | 0.389180 | 0.066107 | 0.516240 | ... | 0.212954 | 0.178743 | 0.085271 | 0.127891 | -0.323213 | 0.675181 | -0.284927 | -0.097526 | -0.000077 | 0.000177 |
| Online Visits monthly | -0.553088 | -0.018564 | -0.321978 | -0.418729 | -0.539484 | -0.446423 | -0.422371 | -0.247691 | 0.346048 | -0.051226 | ... | -0.277883 | -0.194773 | -0.007362 | -0.123904 | 0.416076 | -0.499082 | 0.476234 | -0.253336 | 0.033645 | 0.039725 |
| AcceptedCmp3 | -0.016174 | -0.032257 | 0.061463 | 0.014424 | 0.018438 | -0.000219 | 0.001780 | 0.124958 | -0.023135 | 0.042958 | ... | 0.080248 | 0.095683 | 0.071702 | -0.061784 | -0.019376 | 0.053041 | -0.005507 | 0.011011 | -0.010984 | -0.003730 |
| AcceptedCmp4 | 0.184400 | 0.017566 | 0.373143 | 0.006396 | 0.091618 | 0.016105 | 0.029313 | 0.024015 | 0.016077 | 0.162932 | ... | 0.311314 | 0.242782 | 0.295050 | 0.066109 | -0.088254 | 0.248805 | -0.076907 | -0.009210 | -0.009242 | 0.005925 |
| AcceptedCmp5 | 0.335943 | -0.000482 | 0.473550 | 0.212871 | 0.376867 | 0.196277 | 0.259230 | 0.181397 | -0.184253 | 0.141189 | ... | 1.000000 | 0.407878 | 0.222121 | -0.010575 | -0.285761 | 0.470278 | -0.348173 | 0.021230 | 0.005544 | -0.044069 |
| AcceptedCmp1 | 0.276820 | -0.021061 | 0.351417 | 0.191816 | 0.313076 | 0.261608 | 0.245102 | 0.170132 | -0.127374 | 0.159292 | ... | 0.407878 | 1.000000 | 0.176637 | 0.009611 | -0.230068 | 0.380825 | -0.279174 | 0.037536 | -0.007580 | 0.001791 |
| AcceptedCmp2 | 0.087545 | -0.001400 | 0.206185 | -0.009980 | 0.043521 | 0.002345 | 0.010188 | 0.050734 | -0.037981 | 0.034829 | ... | 0.222121 | 0.176637 | 1.000000 | 0.006717 | -0.069955 | 0.136161 | -0.081522 | 0.000838 | -0.017265 | 0.025189 |
| Age | 0.161791 | 0.016295 | 0.159451 | 0.017747 | 0.033697 | 0.040425 | 0.020204 | 0.064208 | 0.058668 | 0.153051 | ... | -0.010575 | 0.009611 | 0.006717 | 1.000000 | 0.087398 | 0.113487 | -0.012448 | 0.027288 | -0.006652 | 0.000813 |
| Children | -0.293352 | 0.018290 | -0.353748 | -0.395901 | -0.504545 | -0.427841 | -0.389411 | -0.268918 | 0.436076 | -0.148871 | ... | -0.285761 | -0.230068 | -0.069955 | 0.087398 | 1.000000 | -0.500244 | 0.799805 | 0.031054 | 0.003521 | -0.006526 |
| Total_Spent | 0.667576 | 0.020066 | 0.893136 | 0.613249 | 0.845884 | 0.642371 | 0.607062 | 0.528708 | -0.065854 | 0.528973 | ... | 0.470278 | 0.380825 | 0.136161 | 0.113487 | -0.500244 | 1.000000 | -0.522629 | -0.143066 | 0.034332 | -0.009789 |
| has_kids | -0.338153 | 0.002485 | -0.343094 | -0.411963 | -0.574931 | -0.450318 | -0.402722 | -0.247433 | 0.388425 | -0.074008 | ... | -0.348173 | -0.279174 | -0.081522 | -0.012448 | 0.799805 | -0.522629 | 1.000000 | 0.004366 | 0.001809 | -0.005457 |
| Year | 0.022451 | -0.027064 | -0.154991 | -0.054961 | -0.078562 | -0.067327 | -0.073794 | -0.143728 | -0.185314 | -0.169698 | ... | 0.021230 | 0.037536 | 0.000838 | 0.027288 | 0.031054 | -0.143066 | 0.004366 | 1.000000 | -0.367270 | -0.087528 |
| Month | -0.013887 | -0.004930 | 0.039186 | 0.000414 | 0.030105 | -0.011281 | 0.006082 | 0.020835 | -0.002327 | 0.021741 | ... | 0.005544 | -0.007580 | -0.017265 | -0.006652 | 0.003521 | 0.034332 | 0.001809 | -0.367270 | 1.000000 | -0.018416 |
| Day | -0.031473 | 0.018781 | 0.000058 | -0.021932 | -0.019195 | -0.015993 | 0.001321 | 0.001498 | -0.002719 | 0.007970 | ... | -0.044069 | 0.001791 | 0.025189 | 0.000813 | -0.006526 | -0.009789 | -0.005457 | -0.087528 | -0.018416 | 1.000000 |
25 rows × 25 columns
# Set up colors preferences
pallet = ["#682F2F", "#9E726F", "#D6B2B1", "#B9C0C9", "#9F8A78", "#F3AB60"]
color_discrete_map = {"False": pallet[0], "True": pallet[-1]}
# Select features to plot
plot = ["Income", "Last Bought", "Age", "Total_Spent", "has_kids"]
# Create pairplot using plotly express
fig = px.scatter_matrix(data[plot], dimensions=plot[:-1], color="has_kids",
color_discrete_map=color_discrete_map, opacity=0.7)
# Update the layout of the figure
fig.update_layout(plot_bgcolor="#FFF9ED", paper_bgcolor="#FFF9ED", height=800)
# Show the figure
fig.show()
data.head()
| Education | Income | Last Bought | Wines | Fruits | Meat | Fish | Sweets | Gold | Deals Buying | ... | AcceptedCmp1 | AcceptedCmp2 | Age | Children | Total_Spent | Companion | has_kids | Year | Month | Day | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Graduate | 58138.0 | 58 | 635 | 88 | 546 | 172 | 88 | 88 | 3 | ... | 0 | 0 | 66 | 0 | 1617 | Alone | 0 | 2012 | 4 | 9 |
| 1 | Graduate | 46344.0 | 38 | 11 | 1 | 6 | 2 | 1 | 6 | 2 | ... | 0 | 0 | 69 | 2 | 27 | Alone | 1 | 2014 | 8 | 3 |
| 2 | Graduate | 71613.0 | 26 | 426 | 49 | 127 | 111 | 21 | 42 | 1 | ... | 0 | 0 | 58 | 0 | 776 | Partner | 0 | 2013 | 8 | 21 |
| 3 | Graduate | 26646.0 | 26 | 11 | 4 | 20 | 10 | 3 | 5 | 2 | ... | 0 | 0 | 39 | 1 | 53 | Partner | 1 | 2014 | 10 | 2 |
| 4 | Postgraduate | 58293.0 | 94 | 173 | 43 | 118 | 46 | 27 | 15 | 5 | ... | 0 | 0 | 42 | 1 | 422 | Partner | 1 | 2014 | 1 | 19 |
5 rows × 27 columns
data_EDA["Year"]=data_EDA["Dt_Customer"].dt.year
data_EDA["Month"]=data_EDA["Dt_Customer"].dt.month
data_EDA["Day"]=data_EDA["Dt_Customer"].dt.date
data_companion=data.groupby(by=["Month", "Companion"], as_index=False)["Total_Spent"].sum()
sns.barplot(x='Month', y='Total_Spent', hue='Companion', data=data_companion, palette=["#682F2F", "#F3AB60"])
plt.title('Total Spent by Month and Companion')
plt.xlabel('Month')
plt.ylabel('Total Spent')
plt.show()
data_haskids=data.groupby(by=["Month", "has_kids"], as_index=False)["Total_Spent"].sum()
sns.barplot(x='Month', y='Total_Spent', hue='has_kids', data=data_haskids, palette=["#682F2F", "#F3AB60"])
plt.title('Total Spent by Month and Has kids or not')
plt.xlabel('Month')
plt.ylabel('Total Spent')
plt.show()
data.head()
| Education | Income | Last Bought | Wines | Fruits | Meat | Fish | Sweets | Gold | Deals Buying | ... | AcceptedCmp1 | AcceptedCmp2 | Age | Children | Total_Spent | Companion | has_kids | Year | Month | Day | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Graduate | 58138.0 | 58 | 635 | 88 | 546 | 172 | 88 | 88 | 3 | ... | 0 | 0 | 66 | 0 | 1617 | Alone | 0 | 2012 | 4 | 9 |
| 1 | Graduate | 46344.0 | 38 | 11 | 1 | 6 | 2 | 1 | 6 | 2 | ... | 0 | 0 | 69 | 2 | 27 | Alone | 1 | 2014 | 8 | 3 |
| 2 | Graduate | 71613.0 | 26 | 426 | 49 | 127 | 111 | 21 | 42 | 1 | ... | 0 | 0 | 58 | 0 | 776 | Partner | 0 | 2013 | 8 | 21 |
| 3 | Graduate | 26646.0 | 26 | 11 | 4 | 20 | 10 | 3 | 5 | 2 | ... | 0 | 0 | 39 | 1 | 53 | Partner | 1 | 2014 | 10 | 2 |
| 4 | Postgraduate | 58293.0 | 94 | 173 | 43 | 118 | 46 | 27 | 15 | 5 | ... | 0 | 0 | 42 | 1 | 422 | Partner | 1 | 2014 | 1 | 19 |
5 rows × 27 columns
education=data["Education"].unique()
dict1={key:index for index,key in enumerate(education, 0)}
data["Education"]= data["Education"].map(dict1)
dict2={"Alone":0, "Partner":1}
data["Companion"]= data["Companion"].map(dict2)
data.head()
| Education | Income | Last Bought | Wines | Fruits | Meat | Fish | Sweets | Gold | Deals Buying | ... | AcceptedCmp1 | AcceptedCmp2 | Age | Children | Total_Spent | Companion | has_kids | Year | Month | Day | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 58138.0 | 58 | 635 | 88 | 546 | 172 | 88 | 88 | 3 | ... | 0 | 0 | 66 | 0 | 1617 | 0 | 0 | 2012 | 4 | 9 |
| 1 | 0 | 46344.0 | 38 | 11 | 1 | 6 | 2 | 1 | 6 | 2 | ... | 0 | 0 | 69 | 2 | 27 | 0 | 1 | 2014 | 8 | 3 |
| 2 | 0 | 71613.0 | 26 | 426 | 49 | 127 | 111 | 21 | 42 | 1 | ... | 0 | 0 | 58 | 0 | 776 | 1 | 0 | 2013 | 8 | 21 |
| 3 | 0 | 26646.0 | 26 | 11 | 4 | 20 | 10 | 3 | 5 | 2 | ... | 0 | 0 | 39 | 1 | 53 | 1 | 1 | 2014 | 10 | 2 |
| 4 | 1 | 58293.0 | 94 | 173 | 43 | 118 | 46 | 27 | 15 | 5 | ... | 0 | 0 | 42 | 1 | 422 | 1 | 1 | 2014 | 1 | 19 |
5 rows × 27 columns
For this Dataset, we are going to try using clustering algorithm, which is an unsupervised learning algorithm and does not necessarily need a target label.
The goal is to build a pipleline that can take different algorithms and find the best parameters for the particular algorithm and also gives a performance score to compare the different algorithms and select the best performing model
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.metrics import rand_score
This Function was written to be an automated pipeline for trying out various clustering algorithms on the dataset to compare scores and finally choose the best fit for the data. But, there seems to be memory issues or some issue with the code that it's putting too much load on the CPU. That is why I have Commented this region out and will proceed with a more straightforward but longer method
# def my_pipeline(data):
# # Define the pipelines to use for each algorithm
# param_grids = [
# {
# 'kmeans__n_clusters': [2, 4, 6, 8],
# 'kmeans__init': ['k-means++', 'random'],
# 'kmeans__max_iter': [100, 200, 300],
# 'kmeans__n_init': [10, 20, 30]
# },
# {
# 'agg_clustering__n_clusters': [2, 4, 6, 8],
# 'agg_clustering__linkage': ['ward', 'complete', 'average']
# }
# ]
# algorithms = [
# ('kmeans', KMeans()),
# ('agg_clustering', AgglomerativeClustering())
# ]
# scaler= StandardScaler()
# pipelines = [
# Pipeline([
# ('scaler', scaler),
# ('kmeans', KMeans())
# ]),
# Pipeline([
# ('scaler', scaler),
# ('agg_clustering', AgglomerativeClustering())
# ]),
# ]
# scorer=silhouette_score
# # Combine the algorithm, parameter, and pipeline definitions into a single list of tuples
# estimators = list(zip(algorithms, pipelines, param_grids))
# data_arr = np.array(data)
# # reshape the data array to have 2 dimensions
# data_arr = data_arr.reshape(-1, 1)
# # Loop over each estimator and hypertune its parameters using GridSearchCV
# for estimator, pipeline, param_grid in estimators:
# grid_search = GridSearchCV(pipeline, param_grid=param_grid, cv=5, n_jobs=-1, scoring=scorer)
# grid_search.fit(data_arr)
# print(f"Best hyperparameters for {estimator[0]}: {grid_search.best_params_}")
# print(f"Best silhouette score for {estimator[0]}: {grid_search.best_score_}")
# # Use the best estimator to make predictions and create a visualization
# best_estimator = grid_search.best_estimator_
# if estimator is KMeans():
# y_pred = best_estimator.predict(data)
# else:
# y_pred = best_estimator.fit(data)
# # Create a scatter plot of the clustering results
# #plt.scatter(data_arr[:, 0], data_arr[:, 1], c=y_pred, cmap='viridis')
# #plt.title(f"{estimator[0]} Clustering Results")
# #plt.xlabel('Feature 1')
# #plt.ylabel('Feature 2')
# #plt.show()
# Select features to cluster on
dx = data.copy()
stsc = StandardScaler()
dx_scale = stsc.fit_transform(dx)
Feature scaling entails changing the incoming features so that their scales are comparable. There are several methods for adjusting features, including standardization and leveling. Standardization entails removing each feature's mean and splitting it by its standard deviation, whereas normalization entails scaling each feature to a range between 0 and 1 or -1 and 1.
After the features have been resized, PCA can be applied to the dataset. The scaled features are then transformed by PCA into a new collection of variables known as principal components, which capture the most essential information in the original dataset.
PCA also helps to visualise data in an easier way
pca = PCA(n_components=2)
dx_dim_red = pca.fit_transform(dx_scale)
trainx,testx,=train_test_split(dx_dim_red,test_size=0.2,random_state=123)
# Define the pipeline
pipeline = Pipeline([
('kmeans', KMeans())
])
params = {
'kmeans__n_clusters': range(2, 11),
'kmeans__init': ['k-means++', 'random'],
'kmeans__max_iter': [100, 200, 300, 400, 500],
}
Pipeline- Will give the Algo
Params- Will give the Parameters to be hypertuned.
g = GridSearchCV(pipeline, param_grid=params, cv=5, n_jobs=-1)
g.fit(trainx)
results = -g.best_score_ # Negate the score to get positive value
# Print best parameters
print("Best parameters: ", g.best_params_)
print("Best score:", results)
Best parameters: {'kmeans__init': 'random', 'kmeans__max_iter': 300, 'kmeans__n_clusters': 10}
Best score: 280.2350367134206
km = KMeans(n_clusters=g.best_params_['kmeans__n_clusters'], init=g.best_params_['kmeans__init'], max_iter=g.best_params_['kmeans__max_iter'])
km.fit(trainx)
clust_target1=km.predict(trainx)
# Assign the columns of new_points: xs and ys
x = trainx[:,0]
y = trainx[:,1]
# Make a scatter plot of xs and ys, using labels to define the colors
fig=px.scatter(x=x,y=y, color=clust_target1, height=400, width=400)
fig.show()
clust_target1=km.predict(testx)
# Assign the columns of new_points: xs and ys
x = testx[:,0]
y = testx[:,1]
# Make a scatter plot of xs and ys, using labels to define the colors
fig=px.scatter(x=x,y=y, color=clust_target1, height=400, width=400)
fig.show()
As we can see the model is clustering in the same way as for the training set, we can say that the model is working properly. n_clusters= 10 was provided by the gridSearchCV best params, but let's also consider the elbow method which is very common for K-Means clustering.
Here I will use an attribute of the Kmeans algorithm that is called (.inertia_). This attribute gives back the Sum of squared distances of samples to their closest cluster center, weighted by the sample weights if provided.
inertias = []
for k in range(1, 11):
km = KMeans(n_clusters=k, init=g.best_params_['kmeans__init'], max_iter=g.best_params_['kmeans__max_iter'])
km.fit(trainx)
inertias.append(km.inertia_)
plt.plot(range(1, 11), inertias)
plt.title('Elbow curve')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.show()
So let's also try to put k=3 to the model and then fit the model with our trained data
km = KMeans(n_clusters=3, init=g.best_params_['kmeans__init'], max_iter=g.best_params_['kmeans__max_iter'])
km.fit(trainx)
clust_target1=km.predict(trainx)
# Assign the columns of new_points: xs and ys
x = trainx[:,0]
y = trainx[:,1]
# Make a scatter plot of xs and ys, using labels to define the colors
fig=px.scatter(x=x,y=y, color=clust_target1, height=400, width=400)
fig.show()
clust_target1=km.predict(testx)
# Assign the columns of new_points: xs and ys
x = testx[:,0]
y = testx[:,1]
# Make a scatter plot of xs and ys, using labels to define the colors
fig=px.scatter(x=x,y=y, color=clust_target1, height=400, width=400)
fig.show()
For linkage
'ward' reduces the variance of the combined groups.
'average' takes the mean of the distances between each measurement in the two groups.
The greatest distances between all data in the two groups are used in 'complete' or'maximum' linkage.
'single' takes the shortest distance between all data in the two groups.
pipeline2 = Pipeline([
('aglo', AgglomerativeClustering(compute_full_tree=True))
])
params2 = {
'aglo__n_clusters': [2, 3],
'aglo__linkage': ['ward', 'complete', 'average']
}
results2 = silhouette_score
g2 = GridSearchCV(pipeline2, param_grid=params2, cv=5, n_jobs=-1, scoring= results2)
g2.fit(trainx)
# Print best parameters
print("Best parameters: ", g2.best_params_)
print("Best score:", g2.best_score_)
#There is some issue(probably in code) with the scoring attribute, it is unable to return the silhouette score
#and giving a long error message. Although, we have the best parameters and can fit our model with these
#parameters and visualise the clusters. This error is because Agglomerative Clustering expects a target label as
#well, so it expects a 2D array instead of a 1D Array
# 30 % OF THE SIMILARITY SCORE IS ONLY BECAUSE OF THIS ERROR MESSAGE
Best parameters: {'aglo__linkage': 'ward', 'aglo__n_clusters': 2}
Best score: nan
/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:770: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 759, in _score
scores = scorer(estimator, X_test)
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 117, in silhouette_score
return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 212, in silhouette_samples
X, labels = check_X_y(X, labels, accept_sparse=["csc", "csr"])
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 964, in check_X_y
X = check_array(
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 769, in check_array
raise ValueError(
ValueError: Expected 2D array, got 1D array instead:
array=[AgglomerativeClustering(compute_full_tree=True)].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
warnings.warn(
/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:770: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 759, in _score
scores = scorer(estimator, X_test)
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 117, in silhouette_score
return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 212, in silhouette_samples
X, labels = check_X_y(X, labels, accept_sparse=["csc", "csr"])
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 964, in check_X_y
X = check_array(
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 769, in check_array
raise ValueError(
ValueError: Expected 2D array, got 1D array instead:
array=[AgglomerativeClustering(compute_full_tree=True)].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
warnings.warn(
/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:770: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 759, in _score
scores = scorer(estimator, X_test)
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 117, in silhouette_score
return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 212, in silhouette_samples
X, labels = check_X_y(X, labels, accept_sparse=["csc", "csr"])
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 964, in check_X_y
X = check_array(
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 769, in check_array
raise ValueError(
ValueError: Expected 2D array, got 1D array instead:
array=[AgglomerativeClustering(compute_full_tree=True)].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
warnings.warn(
/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:770: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 759, in _score
scores = scorer(estimator, X_test)
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 117, in silhouette_score
return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 212, in silhouette_samples
X, labels = check_X_y(X, labels, accept_sparse=["csc", "csr"])
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 964, in check_X_y
X = check_array(
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 769, in check_array
raise ValueError(
ValueError: Expected 2D array, got 1D array instead:
array=[AgglomerativeClustering(compute_full_tree=True, n_clusters=3)].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
warnings.warn(
/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:770: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 759, in _score
scores = scorer(estimator, X_test)
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 117, in silhouette_score
return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 212, in silhouette_samples
X, labels = check_X_y(X, labels, accept_sparse=["csc", "csr"])
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 964, in check_X_y
X = check_array(
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 769, in check_array
raise ValueError(
ValueError: Expected 2D array, got 1D array instead:
array=[AgglomerativeClustering(compute_full_tree=True, n_clusters=3)].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
warnings.warn(
/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:770: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 759, in _score
scores = scorer(estimator, X_test)
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 117, in silhouette_score
return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 212, in silhouette_samples
X, labels = check_X_y(X, labels, accept_sparse=["csc", "csr"])
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 964, in check_X_y
X = check_array(
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 769, in check_array
raise ValueError(
ValueError: Expected 2D array, got 1D array instead:
array=[AgglomerativeClustering(compute_full_tree=True)].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
warnings.warn(
/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:770: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 759, in _score
scores = scorer(estimator, X_test)
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 117, in silhouette_score
return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 212, in silhouette_samples
X, labels = check_X_y(X, labels, accept_sparse=["csc", "csr"])
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 964, in check_X_y
X = check_array(
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 769, in check_array
raise ValueError(
ValueError: Expected 2D array, got 1D array instead:
array=[AgglomerativeClustering(compute_full_tree=True, n_clusters=3)].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
warnings.warn(
/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:770: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 759, in _score
scores = scorer(estimator, X_test)
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 117, in silhouette_score
return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 212, in silhouette_samples
X, labels = check_X_y(X, labels, accept_sparse=["csc", "csr"])
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 964, in check_X_y
X = check_array(
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 769, in check_array
raise ValueError(
ValueError: Expected 2D array, got 1D array instead:
array=[AgglomerativeClustering(compute_full_tree=True)].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
warnings.warn(
/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:770: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 759, in _score
scores = scorer(estimator, X_test)
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 117, in silhouette_score
return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 212, in silhouette_samples
X, labels = check_X_y(X, labels, accept_sparse=["csc", "csr"])
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 964, in check_X_y
X = check_array(
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 769, in check_array
raise ValueError(
ValueError: Expected 2D array, got 1D array instead:
array=[AgglomerativeClustering(compute_full_tree=True, n_clusters=3)].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
warnings.warn(
/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:770: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 759, in _score
scores = scorer(estimator, X_test)
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 117, in silhouette_score
return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 212, in silhouette_samples
X, labels = check_X_y(X, labels, accept_sparse=["csc", "csr"])
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 964, in check_X_y
X = check_array(
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 769, in check_array
raise ValueError(
ValueError: Expected 2D array, got 1D array instead:
array=[AgglomerativeClustering(compute_full_tree=True, linkage='complete')].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
warnings.warn(
/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:770: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 759, in _score
scores = scorer(estimator, X_test)
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 117, in silhouette_score
return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 212, in silhouette_samples
X, labels = check_X_y(X, labels, accept_sparse=["csc", "csr"])
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 964, in check_X_y
X = check_array(
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 769, in check_array
raise ValueError(
ValueError: Expected 2D array, got 1D array instead:
array=[AgglomerativeClustering(compute_full_tree=True, linkage='complete',
n_clusters=3) ].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
warnings.warn(
/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:770: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 759, in _score
scores = scorer(estimator, X_test)
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 117, in silhouette_score
return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 212, in silhouette_samples
X, labels = check_X_y(X, labels, accept_sparse=["csc", "csr"])
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 964, in check_X_y
X = check_array(
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 769, in check_array
raise ValueError(
ValueError: Expected 2D array, got 1D array instead:
array=[AgglomerativeClustering(compute_full_tree=True, linkage='complete')].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
warnings.warn(
/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:770: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 759, in _score
scores = scorer(estimator, X_test)
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 117, in silhouette_score
return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 212, in silhouette_samples
X, labels = check_X_y(X, labels, accept_sparse=["csc", "csr"])
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 964, in check_X_y
X = check_array(
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 769, in check_array
raise ValueError(
ValueError: Expected 2D array, got 1D array instead:
array=[AgglomerativeClustering(compute_full_tree=True, n_clusters=3)].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
warnings.warn(
/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:770: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 759, in _score
scores = scorer(estimator, X_test)
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 117, in silhouette_score
return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 212, in silhouette_samples
X, labels = check_X_y(X, labels, accept_sparse=["csc", "csr"])
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 964, in check_X_y
X = check_array(
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 769, in check_array
raise ValueError(
ValueError: Expected 2D array, got 1D array instead:
array=[AgglomerativeClustering(compute_full_tree=True, linkage='complete')].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
warnings.warn(
/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:770: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 759, in _score
scores = scorer(estimator, X_test)
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 117, in silhouette_score
return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 212, in silhouette_samples
X, labels = check_X_y(X, labels, accept_sparse=["csc", "csr"])
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 964, in check_X_y
X = check_array(
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 769, in check_array
raise ValueError(
ValueError: Expected 2D array, got 1D array instead:
array=[AgglomerativeClustering(compute_full_tree=True, linkage='complete')].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
warnings.warn(
/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:770: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 759, in _score
scores = scorer(estimator, X_test)
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 117, in silhouette_score
return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 212, in silhouette_samples
X, labels = check_X_y(X, labels, accept_sparse=["csc", "csr"])
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 964, in check_X_y
X = check_array(
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 769, in check_array
raise ValueError(
ValueError: Expected 2D array, got 1D array instead:
array=[AgglomerativeClustering(compute_full_tree=True, linkage='complete')].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
warnings.warn(
/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:770: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 759, in _score
scores = scorer(estimator, X_test)
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 117, in silhouette_score
return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 212, in silhouette_samples
X, labels = check_X_y(X, labels, accept_sparse=["csc", "csr"])
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 964, in check_X_y
X = check_array(
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 769, in check_array
raise ValueError(
ValueError: Expected 2D array, got 1D array instead:
array=[AgglomerativeClustering(compute_full_tree=True, linkage='complete',
n_clusters=3) ].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
warnings.warn(
/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:770: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 759, in _score
scores = scorer(estimator, X_test)
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 117, in silhouette_score
return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 212, in silhouette_samples
X, labels = check_X_y(X, labels, accept_sparse=["csc", "csr"])
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 964, in check_X_y
X = check_array(
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 769, in check_array
raise ValueError(
ValueError: Expected 2D array, got 1D array instead:
array=[AgglomerativeClustering(compute_full_tree=True, linkage='complete',
n_clusters=3) ].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
warnings.warn(
/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:770: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 759, in _score
scores = scorer(estimator, X_test)
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 117, in silhouette_score
return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 212, in silhouette_samples
X, labels = check_X_y(X, labels, accept_sparse=["csc", "csr"])
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 964, in check_X_y
X = check_array(
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 769, in check_array
raise ValueError(
ValueError: Expected 2D array, got 1D array instead:
array=[AgglomerativeClustering(compute_full_tree=True, linkage='complete',
n_clusters=3) ].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
warnings.warn(
/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:770: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 759, in _score
scores = scorer(estimator, X_test)
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 117, in silhouette_score
return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 212, in silhouette_samples
X, labels = check_X_y(X, labels, accept_sparse=["csc", "csr"])
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 964, in check_X_y
X = check_array(
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 769, in check_array
raise ValueError(
ValueError: Expected 2D array, got 1D array instead:
array=[AgglomerativeClustering(compute_full_tree=True, linkage='average')].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
warnings.warn(
/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:770: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 759, in _score
scores = scorer(estimator, X_test)
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 117, in silhouette_score
return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 212, in silhouette_samples
X, labels = check_X_y(X, labels, accept_sparse=["csc", "csr"])
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 964, in check_X_y
X = check_array(
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 769, in check_array
raise ValueError(
ValueError: Expected 2D array, got 1D array instead:
array=[AgglomerativeClustering(compute_full_tree=True, linkage='average')].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
warnings.warn(
/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:770: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 759, in _score
scores = scorer(estimator, X_test)
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 117, in silhouette_score
return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 212, in silhouette_samples
X, labels = check_X_y(X, labels, accept_sparse=["csc", "csr"])
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 964, in check_X_y
X = check_array(
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 769, in check_array
raise ValueError(
ValueError: Expected 2D array, got 1D array instead:
array=[AgglomerativeClustering(compute_full_tree=True, linkage='complete',
n_clusters=3) ].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
warnings.warn(
/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:770: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 759, in _score
scores = scorer(estimator, X_test)
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 117, in silhouette_score
return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 212, in silhouette_samples
X, labels = check_X_y(X, labels, accept_sparse=["csc", "csr"])
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 964, in check_X_y
X = check_array(
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 769, in check_array
raise ValueError(
ValueError: Expected 2D array, got 1D array instead:
array=[AgglomerativeClustering(compute_full_tree=True, linkage='average')].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
warnings.warn(
/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:770: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 759, in _score
scores = scorer(estimator, X_test)
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 117, in silhouette_score
return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 212, in silhouette_samples
X, labels = check_X_y(X, labels, accept_sparse=["csc", "csr"])
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 964, in check_X_y
X = check_array(
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 769, in check_array
raise ValueError(
ValueError: Expected 2D array, got 1D array instead:
array=[AgglomerativeClustering(compute_full_tree=True, linkage='average')].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
warnings.warn(
/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:770: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 759, in _score
scores = scorer(estimator, X_test)
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 117, in silhouette_score
return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 212, in silhouette_samples
X, labels = check_X_y(X, labels, accept_sparse=["csc", "csr"])
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 964, in check_X_y
X = check_array(
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 769, in check_array
raise ValueError(
ValueError: Expected 2D array, got 1D array instead:
array=[AgglomerativeClustering(compute_full_tree=True, linkage='average')].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
warnings.warn(
/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:770: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 759, in _score
scores = scorer(estimator, X_test)
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 117, in silhouette_score
return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 212, in silhouette_samples
X, labels = check_X_y(X, labels, accept_sparse=["csc", "csr"])
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 964, in check_X_y
X = check_array(
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 769, in check_array
raise ValueError(
ValueError: Expected 2D array, got 1D array instead:
array=[AgglomerativeClustering(compute_full_tree=True, linkage='average', n_clusters=3)].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
warnings.warn(
/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:770: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 759, in _score
scores = scorer(estimator, X_test)
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 117, in silhouette_score
return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 212, in silhouette_samples
X, labels = check_X_y(X, labels, accept_sparse=["csc", "csr"])
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 964, in check_X_y
X = check_array(
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 769, in check_array
raise ValueError(
ValueError: Expected 2D array, got 1D array instead:
array=[AgglomerativeClustering(compute_full_tree=True, linkage='average', n_clusters=3)].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
warnings.warn(
/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:770: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 759, in _score
scores = scorer(estimator, X_test)
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 117, in silhouette_score
return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 212, in silhouette_samples
X, labels = check_X_y(X, labels, accept_sparse=["csc", "csr"])
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 964, in check_X_y
X = check_array(
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 769, in check_array
raise ValueError(
ValueError: Expected 2D array, got 1D array instead:
array=[AgglomerativeClustering(compute_full_tree=True, linkage='average', n_clusters=3)].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
warnings.warn(
/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:770: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 759, in _score
scores = scorer(estimator, X_test)
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 117, in silhouette_score
return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 212, in silhouette_samples
X, labels = check_X_y(X, labels, accept_sparse=["csc", "csr"])
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 964, in check_X_y
X = check_array(
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 769, in check_array
raise ValueError(
ValueError: Expected 2D array, got 1D array instead:
array=[AgglomerativeClustering(compute_full_tree=True, linkage='average', n_clusters=3)].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
warnings.warn(
/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:770: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 759, in _score
scores = scorer(estimator, X_test)
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 117, in silhouette_score
return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 212, in silhouette_samples
X, labels = check_X_y(X, labels, accept_sparse=["csc", "csr"])
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 964, in check_X_y
X = check_array(
File "/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 769, in check_array
raise ValueError(
ValueError: Expected 2D array, got 1D array instead:
array=[AgglomerativeClustering(compute_full_tree=True, linkage='average', n_clusters=3)].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
warnings.warn(
/Users/ayushphukan/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_search.py:969: UserWarning:
One or more of the test scores are non-finite: [nan nan nan nan nan nan]
aglo = AgglomerativeClustering(n_clusters=g2.best_params_['aglo__n_clusters'], linkage= g2.best_params_['aglo__linkage'])
aglo.fit(trainx)
clust_target2=aglo.labels_
#labels=agglo.predict(x_train)
# Assign the columns of new_points: xs and ys
x = trainx[:,0]
y = trainx[:,1]
# Make a scatter plot of xs and ys, using labels to define the colors
fig=px.scatter(x=x,y=y, color=clust_target2, height=400, width=400)
fig.show()
Also, after using pca, it completely changes the dimension of the dataset, so it is difficult to understand which cluster belongs to which datapoints exactly from the original dataset. Therefore it is difficult to analyse the population type for the specific cluster. I wrote a code to get a dataset for each cluster and was successful in that but then I tried to reverse pca on each cluster to get the actual sclaed datapoints back but it didn't work.
I tool help from https://stats.stackexchange.com/questions/229092/how-to-reverse-pca-and-reconstruct-original-variables-from-several-principal-com to do it but it was asking for a dataset with the original number of columns. I also tried to run the model without applying pca but due to too many features its was unable to properly visualise the clusters.
Your feedback on this would be really appreciated as I would really want to solve this issue.
Dear Professor Mohommad Mahdavi,
I am writing to express my deepest gratitude for giving me the opportunity to undertake a machine learning project as my final assessment in your data science class. It was an incredible learning experience, and I am honored to have had the opportunity to work on this project under your guidance.